is a widely used approach to describe the relationship between a response variable y_i and a set of predictors x_{ik}. Linear models are popular due to their simplicity, ease of implementation, and interpretability. The coefficients provide direct insights into the relationship between each predictor and the response, making it straightforward to infer the impact of individual variables.
Show the code
# Generating some datax <-seq(-10, 10, length.out =100)y_linear <- x +rnorm(length(x), mean =0, sd =2)y_quadratic <- x^2+rnorm(length(x), mean =0, sd =4)y_cubic <- x^3+rnorm(length(x), mean =0, sd =8)y_sinusoidal <-sin(x) +rnorm(length(x), mean =0, sd =0.3)data <-data.frame(x = x, y_linear = y_linear,y_quadratic = y_quadratic, y_cubic = y_cubic,y_sinusoidal = y_sinusoidal)# Plottingp_1 <- data |>ggplot(aes(x = x, y = y_linear)) +geom_point(size =1.5, colour ="steelblue4") +labs(title ="Linear relationship", y ="y")p_2 <- data |>ggplot(aes(x = x, y = y_quadratic)) +geom_point(colour ="steelblue4") +labs(title ="Quadratic relationship", y ="y")p_3 <- data |>ggplot(aes(x = x, y = y_cubic)) +geom_point(colour ="steelblue4") +labs(title ="Cubic relationship", y ="y")p_4 <- data |>ggplot(aes(x = x, y = y_sinusoidal)) +geom_point(colour ="steelblue4") +labs(title ="Sinusoidal relationship", y ="y")# Arrange plots in a grid layout(p_1 + p_2) / (p_3 + p_4)
Figure 1: Comparison of linear and nonlinear relationships. Each panel illustrates a different relationship between x and y (linear, quadratic, cubic, and sinusoidal)
However, the standard linear model relies on the assumption of linearity, which is often an approximation of the true underlying relationship. In many real-world scenarios, this assumption may not hold (Figure 1), leading to significant limitations in predictive accuracy. When the true relationship is nonlinear, a linear model may fail to capture complex patterns, resulting in poor performance.
We can see this in the figure below where our straight-line equation captures the data well in the top left plot, but fails to do for the remaining plots.
Figure 2: We can see that our linear model does not do a good job at capturing the curvature in our data.
As such, various extensions of linear models have been developed which both relax the assumption of linearity while also maintaining as much interpretability as possible.
Polynomial regression
Polynomial regression is a natural extension of the standard linear model, where the relationship between the predictor X and the response Y is modeled as an n-th degree polynomial. This allows the model to capture curvature in the data while still using a linear framework for estimation. For instance, a second degree polynomial could be written as follows:
Despite the inclusion of nonlinear terms, polynomial regression remains linear in terms of estimation because the regression function is linear in the unknown parameters (\alpha, \beta_1, \dots, \beta_k). This key property allows us to leverage the techniques of multiple linear regression for estimation and inference. Specifically, polynomial regression can be implemented by treating x, \dots,x^k as distinct independent variables in a multiple regression framework. As a result, the computational and inferential challenges of polynomial regression can be fully addressed using well-established methods, such as ordinary least squares.
Refitting the Model with a Second-Degree Polynomial
Let’s refit a model to our data using a second-degree polynomial. To recap, such polynomial models take the form:
p_2 +geom_smooth(formula = y ~poly(x, 2), method ="lm", colour ="forestgreen", linewidth =1.25, se =FALSE)
Figure 3: A second-degree polynomial fit (green) compared to the linear model (red). The polynomial model captures the curvature in the data, providing a much better fit.
Exploring Higher-Degree Polynomials
While a second-degree polynomial often provides a good balance between flexibility and simplicity, higher-degree polynomials can be used to capture more complex patterns in the data. However, care must be taken to avoid over fitting, where the model becomes too complex and starts capturing noise rather than the underlying trend.
Below are examples of fitting third-degree and ninth-degree polynomials to the same data:
Show the code
p_3 <- p_3 +geom_smooth(formula = y ~poly(x, 3), method ="lm", colour ="forestgreen", linewidth =1.25, se =FALSE)p_4 <- p_4 +geom_smooth(formula = y ~poly(x, 9), method ="lm",colour ="forestgreen", linewidth =1.25, se =FALSE)p_3 + p_4
Figure 4: A third-degree (left) and ninth-degree (right) polynomial fit (green) compared to the linear model (red). As you can see each model fits the data well however may run the risk of overfitting on new data.
Splines and basis functions
Polynomial regression can be seen as a special case of basis function regression, a more general framework for modeling nonlinear relationships. In basis function regression, we represent the nonlinear function of the predictor variable X as a linear combination of simpler functions, known as basis functions. These basis functions allow us to flexibly model complex relationships while retaining the computational advantages of linear regression.
For example, in the case of a nonlinear regression with one predictor variable X and a normally distributed outcome Y, the basis function regression model can be written as:
---title: "Non-linear modelling"description: "Beyond simple linear regression"date: "2025-02-13"format: html: page-layout: full html-math-method: katexcitation: url: "https://c-monaghan.github.io/posts/2025/"execute: warning: falseeditor: source---```{r setting-up, echo=FALSE}set.seed(321)# Packageslibrary(ggplot2)library(patchwork)# Setting active themetheme_set( theme_classic() + theme(plot.title = element_text( hjust = 0.5, size = 12, colour = "grey20")))```For regression, the standard linear model:$$y_i \sim N(\mu_i, \; \sigma^2) \qquad \mu_i = \alpha + \sum^K_{k=1}\beta_kx_{ik} + \epsilon$$is a widely used approach to describe the relationship between a response variable $y_i$ and a set of predictors $x_{ik}$. Linear models are popular due to their simplicity, ease of implementation, and interpretability. The coefficients provide direct insights into the relationship between each predictor and the response, making it straightforward to infer the impact of individual variables.```{r plotting-non-linearity}#| fig-width: 8#| label: fig-non-linear#| fig-cap: Comparison of linear and nonlinear relationships. Each panel illustrates a different relationship between x and y (linear, quadratic, cubic, and sinusoidal)# Generating some datax <- seq(-10, 10, length.out = 100)y_linear <- x + rnorm(length(x), mean = 0, sd = 2)y_quadratic <- x^2 + rnorm(length(x), mean = 0, sd = 4)y_cubic <- x^3 + rnorm(length(x), mean = 0, sd = 8)y_sinusoidal <- sin(x) + rnorm(length(x), mean = 0, sd = 0.3)data <- data.frame( x = x, y_linear = y_linear, y_quadratic = y_quadratic, y_cubic = y_cubic, y_sinusoidal = y_sinusoidal)# Plottingp_1 <- data |> ggplot(aes(x = x, y = y_linear)) + geom_point(size = 1.5, colour = "steelblue4") + labs(title = "Linear relationship", y = "y")p_2 <- data |> ggplot(aes(x = x, y = y_quadratic)) + geom_point(colour = "steelblue4") + labs(title = "Quadratic relationship", y = "y")p_3 <- data |> ggplot(aes(x = x, y = y_cubic)) + geom_point(colour = "steelblue4") + labs(title = "Cubic relationship", y = "y")p_4 <- data |> ggplot(aes(x = x, y = y_sinusoidal)) + geom_point(colour = "steelblue4") + labs(title = "Sinusoidal relationship", y = "y")# Arrange plots in a grid layout(p_1 + p_2) / (p_3 + p_4)```However, the standard linear model relies on the assumption of linearity, which is often an approximation of the true underlying relationship. In many real-world scenarios, this assumption may not hold (@fig-non-linear), leading to significant limitations in predictive accuracy. When the true relationship is nonlinear, a linear model may fail to capture complex patterns, resulting in poor performance. We can see this in the figure below where our straight-line equation captures the data well in the top left plot, but fails to do for the remaining plots.```{r linear-fit}#| fig-width: 8#| label: fig-fit-non-linear#| fig-cap: We can see that our linear model does not do a good job at capturing the curvature in our data.p_1 <- p_1 + geom_smooth( method = "lm", colour = "firebrick4", linewidth = 1.25, se = FALSE)p_2 <- p_2 + geom_smooth( method = "lm", colour = "firebrick4", linewidth = 1.25, se = FALSE)p_3 <- p_3 + geom_smooth( method = "lm", colour = "firebrick4", linewidth = 1.25, se = FALSE)p_4 <- p_4 + geom_smooth( method = "lm", colour = "firebrick4", linewidth = 1.25, se = FALSE)(p_1 + p_2) / (p_3 + p_4)```As such, various extensions of linear models have been developed which both relax the assumption of linearity while also maintaining as much interpretability as possible.## Polynomial regressionPolynomial regression is a natural extension of the standard linear model, where the relationship between the predictor $X$ and the response $Y$ is modeled as an $n$-th degree polynomial. This allows the model to capture curvature in the data while still using a linear framework for estimation. For instance, a second degree polynomial could be written as follows:$$y_i \sim N(\mu_i, \; \sigma^2) \qquad \mu_i= \alpha + \beta_1x_{1i} + \beta_2x_{1i}^2 + \epsilon$$Here, $x_1, x_1^2$ are the polynomial terms, and $\beta_1, \beta_2$ are their corresponding coefficients. We can continue this process for any finite polynomial degree such that:$$\mu_i = \alpha + \sum^K_{k = 1} \beta_kx_i^k + \epsilon$$Despite the inclusion of nonlinear terms, polynomial regression remains linear in terms of estimation because the regression function is linear in the unknown parameters $(\alpha, \beta_1, \dots, \beta_k)$. This key property allows us to leverage the techniques of multiple linear regression for estimation and inference. Specifically, polynomial regression can be implemented by treating $x, \dots,x^k$ as distinct independent variables in a multiple regression framework. As a result, the computational and inferential challenges of polynomial regression can be fully addressed using well-established methods, such as ordinary least squares.#### Refitting the Model with a Second-Degree PolynomialLet's refit a model to our data using a **second-degree polynomial**. To recap, such polynomial models take the form:$$Y \sim N(\mu_i \: \sigma^2) \qquad \mu_i= \alpha + \beta_1x_{1i} + \beta_2x_{1i}^2 + \epsilon$$```{r quadratic-fit}#| fig-width: 8#| label: fig-refit-non-linear#| fig-cap: A second-degree polynomial fit (green) compared to the linear model (red). The polynomial model captures the curvature in the data, providing a much better fit.p_2 + geom_smooth( formula = y ~ poly(x, 2), method = "lm", colour = "forestgreen", linewidth = 1.25, se = FALSE)```#### Exploring Higher-Degree PolynomialsWhile a second-degree polynomial often provides a good balance between flexibility and simplicity, higher-degree polynomials can be used to capture more complex patterns in the data. However, care must be taken to avoid over fitting, where the model becomes too complex and starts capturing noise rather than the underlying trend.Below are examples of fitting **third-degree** and **ninth-degree** polynomials to the same data:```{r high-degree-fit}#| fig-width: 8#| label: fig-overfitting#| fig-cap: A third-degree (left) and ninth-degree (right) polynomial fit (green) compared to the linear model (red). As you can see each model fits the data well however may run the risk of overfitting on new data.p_3 <- p_3 + geom_smooth( formula = y ~ poly(x, 3), method = "lm", colour = "forestgreen", linewidth = 1.25, se = FALSE)p_4 <- p_4 + geom_smooth( formula = y ~ poly(x, 9), method = "lm", colour = "forestgreen", linewidth = 1.25, se = FALSE)p_3 + p_4```## Splines and basis functionsPolynomial regression can be seen as a special case of basis function regression, a more general framework for modeling nonlinear relationships. In basis function regression, we represent the nonlinear function of the predictor variable $X$ as a linear combination of simpler functions, known as basis functions. These basis functions allow us to flexibly model complex relationships while retaining the computational advantages of linear regression.For example, in the case of a nonlinear regression with one predictor variable $X$ and a normally distributed outcome $Y$, the basis function regression model can be written as:$$\mu_i = f(x_i) = \alpha + \sum^K_{k=1} \beta_k \phi_k(x_i)$$where $\phi_k(x_i)$ are the basis functions, which are simple deterministic functions of $X$.In polynomial regression, the basis functions are defined as:$$\phi_k(x_i) = x_i^k $$For example:- $\phi_1(x_i) = x_i$ is a linear term- $\phi_2(x_i) = x_i^2$ is a quadratic term- $\phi_3(x_i) = x_i^3$ is a cubic term